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io Abstract. We propose a new method for detecting changes in Markov 

ii network structure between two sets of samples. Instead of naively fitting 

12 two Markov network models separately to the two data sets and figuring 

13 out their difference, we directly learn the network structure change by 
« estimating the ratio of Markov network models. This density-ratio formu- 
15 lation naturally allows us to introduce sparsity in the network structure 
i6 change, which highly contributes to enhancing interpretability. Further- 
17 more, computation of the normalization term, which is a critical compu- 
i8 tational bottleneck of the naive approach, can be remarkably mitigated. 
19 Through experiments on gene expression and Twitter data analysis, we 

►> ■ 20 demonstrate the usefulness of our method. 

m 

o 

00 ■ 2i 1 Introduction 

vq 

t^- ■ 22 Changes in the structure of interactions between random variables are interesting 

23 in many real-world phenomena. For example, genes may interact with each other 

24 in different ways when external stimuli change, co-occurrence between words may 

25 disappear/appear when the domains of text corpora shift, and correlation among 

26 pixels may change when a surveillance camera captures anomalous activities. 

27 Discovering such changes in interactions is a task of great interest in machine 

28 learning and data mining, because it provides useful insights into underlying 
C^ | 29 mechanisms in many real-world applications. 

30 In this paper, we consider the problem of detecting changes in conditional in- 

31 dependence among random variables between two sets of data. Such conditional 

32 independence structure can be expressed as an undirected graphical model called 

33 a Markov network (MN) |1I2I3| . where nodes and edges represent variables and 

34 their pairwise dependency. 

35 A naive approach to change detection in MNs is the two-step procedure of 

36 first estimating two MNs separately from two sets of data by maximum likelihood 

37 estimation (MLE), and then comparing the structure of learned MNs. However, 

38 MLE is often computationally expensive due to the normalization factor included 

39 in the density model. There are estimation methods which do not rely on knowing 
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Fig. 1. The rationale of direct structural change learning. 



40 the normalization factor [4], but Gaussianity is often assumed for computing 

4i the normalization factor analytically [5]. However, this Gaussian assumption is 

42 highly restrictive in practice. 

43 Another conceptual weakness of the above two-step procedure is that struc- 

44 ture change is not directly learned. This indirect nature causes a problem, for 

45 example, if we want to learn a sparse structure change. For learning sparse 

46 changes, we may utilize £i-regularized MLE [61718] . which produces sparse MNs 

47 and thus the change between MNs also becomes sparse. However, this approach 

48 docs not work if MNs are rather dense but change is sparse. 

49 To mitigate this indirect nature, the fused lasso [9] is useful, where two MNs 
so arc simultaneously learned with a sparsity-inducing penalty on the difference 

51 between two MN parameters [10]. Although this fuscd-lasso approach allows us 

52 to learn sparse structure change naturally, the restrictive Gaussian assumption 

53 is still necessary to obtain the solution in a computationally efficient way. 

54 A nonparanormal assumption [ll|12j is a useful generalization of the Gaus- 

55 sian assumption. A nonparanormal distribution is a semi-parametric Gaussian 
so copula where each Gaussian variable is transformed by a non-linear function. 

57 Nonparanormal distributions arc much more flexible than Gaussian distributions 

58 thanks to the feature-wise non-linear transformation, while the normalization 

59 factors can still be computed analytically. 

60 Thus, the fused-lasso method combined with nonparanormal models would 
6i be the state-of-the-art approach to change detection in MNs. However, the fused- 

62 lasso method is still based on separate modeling of two MNs, and its computation 

63 for more general non-Gaussian distributions is challenging. 

64 In this paper, we propose a more direct approach to structural change learn- 

65 ing in MNs based on density ratio estimation (DRE) |13j . Our method does not 

66 separately model two MNs, but directly models the change in two MNs. This 

67 idea follows Vapnik's principle |14) : 

68 If you possess a restricted amount of information for solving some prob- 

69 lem, try to solve the problem directly and never solve a more general 

70 problem as an intermediate step. It is possible that the available infor- 

71 mation is sufficient for a direct solution but is insufficient for solving a 

72 more general intermediate problem. 

73 This principle was used in the development of support vector machines (SVMs): 

74 Rather than modeling two classes of samples, SVM directly learns a decision 

75 boundary that is sufficient for performing pattern recognition. In the current 

76 context, estimating two MNs is more general than detecting changes in MNs 



77 (Figured]). This direct approach means that we halve the number of parameters, 

78 from two MNs to one MN-diffcrcncc. 

79 Furthermore, the normalization factor in our DRE-based method can be 
so approximated efficiently, because the normalization term in a density ratio func- 
si tion takes the form of an expectation and thus it can be simply approximated 

82 by sample averages without sampling. 

83 The remainder of this paper is structured as follows. In Section [3J we for- 

84 mulatc the problem of detecting structural changes and review currently avail- 

85 able approaches. We then propose our DRE-based structural change detection 
se method in Section [3] Results of illustrative and real- world experiments are re- 
87 ported in Section @] and Section [5j respectively. Finally, we conclude our work 
as and show future directions in Section [B] 



2 Problem Formulation and Related Methods 

In this section, we formulate the problem of change detection in Markov network 
structure and review existing approaches. 



92 2.1 Problem Formulation 

Consider two sets of samples drawn separately from two probability distributions 
P and Q on R d : 

{xf }- ^p{x) and {*?}£ S ? (x). 

We assume that p and q belong to the family of Markov networks (MNs) con- 
sisting of univariate and bivariate factors: 



1 / d d \ 

P(x\ a) = -7^-r exp Y^ a i9i(%i) + ^Z a l 3 9 l , 3 (x l ,x j ) 



(1) 



93 where x = (xi, . . . , Xd) T , tXi t tXij are parameters, g il g i j are univariate and 

94 bivariate vector-valued basis function, and Z(a) is the normalization factor. 

95 q(x) is defined in the same way. 

For notational simplicity, we unify both univariate and bivariate factors as 

p(x;0) = ^-exp(y / 0jf t (x)Y where Z{9) = J exp (j^Oj f t (x) j daj. 

96 q(x) is also simplified in the same way. 

97 Our goal is to detect the change in conditional independence between random 

98 variables from P to Q. 



99 2.2 Sparse MLE and Graphical Lasso 

Maximum likelihood estimation (MLE) with group £i-regularization has been 
widely used for estimating the sparse structure of MNs |15|16|8j : 

>: 
rnax^logp(:zf;0)-A^||0 t ||, (2) 

i=i t 

ioo where || • || denotes the ^2-norm. As A increases, Qt for pairwise factors may drop 

ioi to 0. Thus, this method favors an MN that encodes more conditional indepen- 

102 dencies among variables. For computing the normalization term Z(0) in Eq.([T]), 

103 sampling techniques such as Markov-chain Monte-Carlo (MCMC) and impor- 

104 tance sampling arc usually employed. However, obtaining a reasonable value 

105 by these methods becomes computationally more expensive as the dimension d 
loe grows. 

To avoid this computational problem, the Gaussian assumption is often im- 
posed [7|17j . If we consider a zero- mean Gaussian distribution, the following 
p(x; 0) can be used to replace the density model in Eq.@: 

dct(e) 1 / 2 / i T 

(27r)d/2 exp(-2* 0x 

where is the inverse covariance matrix (a.k.a. the precision matrix) and det(-) 
denotes the determinant. Then is learned by 

m_axlogdet(6>) - tr(0S p ) - X\\0\\i, 

107 where S is the sample covariance matrix of {xf }™ =1 . ||@||i is the li-norm 

los of 0, i.e. the absolute sum of all elements. This formulation has been studied 

109 intensively in [6], and a computationally efficient solution called the graphical 

no lasso [7] has been proposed. 

in Sparse changes in conditional independence structure between P and Q can 

n2 be detected by comparing two MNs separately estimated using sparse MLE. 

n3 However, this approach implicitly assumes that two MNs are sparse, which is 

n4 not necessarily true even if the change is sparse. 



2.3 Fused-Lasso Method 

To more naturally handle sparse changes in conditional independence structure 
between P and Q, a method based on fused lasso [9] has been developed [TO] . This 
method jointly maximizes the conditional likelihood in a feature- wise manner for 
P and Q with a sparsity penalty on the difference between parameters. More 
specifically, for each element x s (s = 1, . . . , d) of x, 

max ef (Of) + tf (0?) - Ax(||flf ||i + ||fl?||i) - A 2 ||flf - 0?||i, 
e p ,e? 



5 

where if (9) is the log conditional likelihood for the s-th element x s € M. given 
the rest a;_ s G R d_1 : 

1=1 

i6 if (9) is defined in the same way as if (9). In this fused-lasso method, Gaus- 
17 sianity is usually assumed to cope with the normalization issue described in 
Section [ 



118 



2.4 Nonparanormal Extensions 

In the above methods, Gaussianity is required in practice to compute the nor- 
malization factor efficiently, which is a highly restrictive assumption. 

To overcome this restriction, it has become popular to perform structure 
learning under the nonparanormal settings [11I12J , where the Gaussian distribu- 
tion is replaced by a semi-parametric Gaussian copula, x = [x\, . . . , xj) 1 is said 
to follow a nonparanormal distribution, if there exists a set of monotone and 
diffcrcntiable functions, {hi(x)}f =1 , such that h(x) = (hi(x^), . . . ,hd(x^)) T 
follow the Gaussian distribution. Nonparanormal distributions are much more 
flexible than Gaussian distributions thanks to the non-linear transformation 
{hi(x)}f =1 , while the normalization factors can still be computed in an ana- 
lytical way. 

The fused-lasso method can more naturally handle sparse changes in MNs 
than separate sparse MLE, and its nonparanormal extension is more flexible 
than the Gaussian counterpart. However, the fused-lasso method is still based 
on separate modeling of two MNs, and its computation for more general non- 
Gaussian distributions is challenging. 



136 3 Direct Learning of Structural Changes via Density 

137 Ratio Estimation 

138 In this section, we propose to directly learn structural changes based on density 

139 ratio estimation |13j . which does not involve separately modeling each MN and 
mo which allows us to approximate the normalization term efficiently. 

wi 3.1 Density Ratio Formulation for Structural Change Detection 

142 Our key idea is to consider the ratio of p and q: 



p(x-9 p 



^««p(5>f-fl?) T /«(« 



q(x\0") 



Here t — 0^ encodes the difference between P and Q for factor f t , i.e. t — 0± 
is zero if there is no change in the t-th factor. 

Once we consider the ratio of p and q, we actually do not have to estimate 
t and 0j ; instead an estimate of their difference 0± = 9 t — ff{ is sufficient for 
change detection: 

r(x ] d) = 1 ^exp\y2ejMx)\, where N(6) = Jq(x)ex P r£eJ f t (x)j. (3) 

The normalization term N(6) guarantee^] J q(x)r(x; 8) Ax = 1. Thus, in this 
density ratio formulation, we are no longer modeling each p and q separately, 
but we model the change from p to q directly. This direct nature would be 
more suitable for change detection purposes according to Vapnik's principle that 
encourages avoidance of solving more general problems as an intermediate step 
|14j . This direct formulation also allows us to halve the number of parameters 
from both P and Q to only 0. 

Furthermore, the normalization factor N(0) in the density ratio formulation 

can be easily approximated by sample average over {xf}™2i ~ q{x), because 
N(8) is the expectation over q(x): 






3.2 Direct Density-Ratio Estimation 

Density ratio estimation (DRE) methods have been recently introduced to the 
machine learning community [13] and are proven to be useful in a wide range of 
applications. Here, we concentrate on a DRE method called the Kullback-Leibler 
importance estimation procedure (KLIEP) for a log-linear model J18I19J . 

For a density ratio model r(x;0), the KLIEP method minimizes the 
Kullback-Leibler divergence from p(x) to p(x) = q(x)r(x; 9): 

KL[p\\p}= I ' p{x) log (a VV . e) dx = Const. - f p(x)logr(x;6). (4) 

Note that our density-ratio model ([3]) automatically satisfies the non-negativity 
and normalization constraints: 

r(x; 6) > and / q(x)r(x\ 6) dx = 1. 



4 An alternative normalization term N'(0, 0®) = J q(x; 0®)r(x; 0)dx may also be 
considered. However, the expectation with respect to a model distribution can be 
computationally expensive as in the case of MLE, and this alternative form requires 
an extra parameter 0® which is not our main interest. It is note worthy that the 
use of N(0) as a normalization factor guarantees the consistency of density ratio 
estimation 1181. 



7 

In practice, we maximize the empirical approximation of the second term in the 
right-hand side of Eq.(j4|): 

1 " F . 
*kliep(0) = — V log r(x p ;9) 
np ti 

Lf^^^D-log^fexp^^^)). 



up 



Because ^kliep(^) is convex with respect to 9, its global maximizer can be 
numerically found by standard optimization techniques such as gradient ascent 
or quasi-Newton methods: The gradient of ^kliep with respect to 9t is given by 



1 - £ ES exp fc »7/«(«?) /«(x?) 

Ve t %LiEP = — V /*(*< ) 

n 7-. fc * 



' np ^ l ^E^exp(E t ^/t(^)) 



157 3.3 Sparsity-Inducing Norm 

158 To find a sparse change in P and Q, we may regularize our KLIEP solution with 

159 a sparsity-inducing norm Et ll^tll- Note that the motivation for introducing 
wo sparsity in KLIEP is different from MLE. In the case of MLE, both 9 P and Q 
lei arc sparsified and then consequently the difference 9 — 9 ^ is also sparsificd. 

162 On the other hand, in our case, only the difference 9 — 9® is sparsified; thus 

163 our method can still work well even if 9 and 9® are dense. 

In practice, we may use the following elastic-net penalty [20] to better control 
overfitting to noisy data: 



max ^KLiEp(0)-A 1 ||0|| 2 -A 2 ^||0 t | 

L t 

where \\9\\ 2 penalizes the magnitude of the entire parameter vector. 



165 4 Numerical Experiments 

166 In this section, we compare the proposed KLIEP-based method with the Fuscd- 

167 lasso (Flasso) method [10] and the Graphical- lasso (Glasso) method [7]. Results 

168 are reported on datasets with three different underlying distributions: Multi- 

169 variatc Gaussian, nonparanormal, and a non-Gaussian "diamond" distribution. 
i7o Considering that the artificial datasets are noise free, we set Ai = for KLIEP 

171 and Flasso. 

172 4.1 Performance Metrics, Model Selection, and Basis Function 

173 By taking the advantage of knowing the ground truth of structural changes in 

174 artificial experiments, we measure the performance of change detection methods 



175 using the precision-recall (P-R) curve. For KLIEP and Flasso, a precision and 

176 recall curve can be plotted by varying group-sparsity control parameter A2 . For 

177 Glasso, we vary the sparsity control parameters as A = A = A®. 

To perform the model selection, we use the log-likelihood of an estimated 
density ratio on a hold-out dataset, which we refer to as hold-out log-likelihood 
(HOLL). More precisely, given two sets of hold-out data \x i }^f 1 ~ P and 
{^i }"=i ~ Qi we use the following quantity: 

1 ^ exp(£ t t T / 4 (2f) 

feoLL = — > _, log — 

np *-i 1 



np . 
1— 1 



£ESi«p(E,3«/«(2?) 



178 
179 



In case such a hold-out dataset is not available, the cross-validated log-likelihood 
(CVLL) may be used instead. 

For the Glasso and Flasso methods, we perform model selection by adding 
the hold-out/cross-validatcd likelihoods on p(x; 0) and q(x; 6) together: 

. np , ™Q 

^X>gp(5-f;e ) + ^^log 9 (Sf;0 Q ). 
np f—' no f-f 



; = 1 



n «S 



180 As basis functions, we consider two types of ft'. A power nonparanormal / npn 

i8i and a polynomial transform / po iy 

The pairwise nonparanormal transform with power k is defined as 

fn P n(xi,Xj) := [sign(xj)a: l fe sign(x :/ )a;)',l]. 

182 This transforms the original data by the power of k, so that the transformed data 

183 are jointly Gaussian (see Section [4. 3[) . The univariate nonparanormal transform 

184 is defined as / n pn(a*) := fn P n(xi,Xi). 

The polynomial transform up to degree of k is defined as: 

Jpoly\Xi 7 Xj) '.= \Xi , Xj , X{Xj , . . . , 27j Xj , X^ ,Xj , . . . , Xi, Xj, 1J 

185 The univariate polynomial transform is defined as / po iy(^i) " fpoiy(xi, 0). 

186 4.2 Multivariate Gaussian 

is? First, we investigate the performance of each learning method under Gaussianity. 
Consider a 40-node sparse Gaussian MN, where its graphical structure is 
characterized by precision matrix with diagonal elements equal to 2. The 
off-diagonal elements are randomly choscnj and set to 0.2, so that the overall 
sparsity of is 25%. We then introduce changes by randomly picking 15 edges 
and reducing the corresponding elements in the precision matrix by 0.1. The 
resulting precision matrices and 0® are used for drawing samples as 

{xnU-mO,^)- 1 ) and {xf}U~M(0,(GQ)-'). 
5 We set &ij — &j : i for not breaking the symmetry of the precision matrix. 
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(a) KLIEP, n = 100 



(b) Flasso, n = 100 
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(g) Gaussian distribution (h) P-R curve, n — 100 (i) P-R curve, n = 50 

Fig. 2. Experimental results on the multivariate Gaussian dataset. 

Datasets of size n = 50 and n = 100 are tested. 

We repeat the experiments 20 times with randomly generated datasets and 
report the results in Figure [U The top 6 graphs are examples of regularization 
paths (color represents the ground truth) and the bottom 3 graphs are the data 
generating distribution and averaged P-R curves with standard error. The top 
row is for n = 100 while the middle row is for n = 50. The regularization 
parameters picked by the model selection procedures described in Section 14.11 
are marked with blue vertical lines. In this experiment, the Gaussian model (the 
nonparanormal basis function with power k = 1) is used. Because the Gaussian 
model is also used in Flasso and Glasso, the difference in performance is caused 
only by the difference of estimation methods. 

When n = 100, KLIEP and Flasso clearly distinguish changed (black) and 
unchanged (red) edges in terms of parameter magnitude. However, when the 
sample size is halved, the separation is visually rather unclear in the case of 
Flasso. In contrast, the paths of changed and unchanged edges are still almost 
disjoint in the case of KLIEP. The Glasso method performs rather poorly in 
both cases. A similar tendency can be observed also in the averaged P-R curve. 
When the sample size is 100, KLIEP and Flasso work equally well, but KLIEP 
gains its lead when the sample size is reduced. Glasso does not perform well in 
both cases. 
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208 4.3 Nonparanormal 

We post-process the dataset used in Section 14.21 to construct nonparanormal 
samples: simply, we apply the power function 

h~ l {x) = sign(x)|a;| T 
to each dimension of x p and x®, so that h(x p ) ~ A/"(0, (0 )~ 1 ) and h(x®) ~ 

Af(o, (e^)- 1 ). 

In order to cope with the non-linearity, we apply the nonparanormal basis 

212 function with power 2, 3 and 4 and choose the one that maximizes the peak 

213 HOLL value. For Flasso and Glasso, we apply the nonparanormal transform 

214 described in [TT] before the structural change is learned. 

215 The experiments are conducted on 20 randomly generated datasets with n = 

216 50 and 100, respectively. The regularization paths, data generating distribution, 
21? and averaged P-R curves are plotted in Figure [3] The results show that Flasso 

218 clearly suffers from the performance degradation compared with the Gaussian 

219 case. Due to the two-step estimation scheme, the performance of Glasso is poor. 

220 In contrast, KLIEP separates changed and unchanged edges still clearly for both 
n = 50 and n = 100. The P-R curves also show the same tendency. 



222 4.4 "Diamond" Distribution with No Pearson Correlation 

223 In the previous experiment, though samples are non-Gaussian, the Pearson cor- 

224 relation is not zero. Therefore, methods assuming Gaussianity can still capture 

225 the linear correlation between random variables. In this experiment, we consider 

226 a more challenging case with a diamond-shaped distribution within the expo- 

227 ncntial family that has zero Pearson correlation coefficient between dependent 

228 variables. Thus, the methods assuming Gaussianity (i.e., Glasso and Flasso) can 

229 not extract any information in principle from this dataset. 

The probability density function of the diamond distribution is defined as 



follows (Figure 4(a) I 



p(x) oc exp - J2 2a=? - £ 20^ , (5) 

V i (i,j):Ai,^0 / 

where the adjacency matrix A describes an MN structure. Note that this distri- 
bution can not be transformed into a Gaussian distribution. Samples from the 
above distribution are drawn by using a slice sampling method [21j . However, 
since generating samples from a high-dimensional distribution is non-trivial and 
time-consuming, we focus on a relatively low-dimensional case to avoid sampling 
errors which may mislead the experimental evaluation. 

We set d = 9 and np = uq = 5000. A p is randomly generated with 35% 
sparsity, while A^ is created by randomly removing edges in A so that the 
sparsity level is dropped to 15%. 
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(c) Glasso, n — 100 




(f) Glasso, 71 = 50 
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Fig. 3. Experimental results on the nonparanormal dataset. 
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In this experiment, we compare the performance of all three methods with 
their available transforms: KLIEP (/ po i y ), KLIEP (/„ pn ,fc = 2,3,4), KLIEP 
(/npn, k — 1; same as the Gaussian model), Flasso (Gaussian), Flasso (nonpara- 
normal), Glasso (Gaussian), and Glasso (nonparanormal). The averaged P-R 
curves are shown in Figure 4(c) As expected, except KLIEP (/ po i y ), all other 
methods do not work properly. This means that the polynomial kernel is indeed 
very helpful in handling completely non-Gaussian data. However, as discussed 
in Section 12.21 it is difficult to use such a kernel in the MLE-based approaches 
(Glasso and Flasso) because computationally demanding sampling is involved 
in computing the normalization term. The regularization path of KLIEP (/ po i y ) 
illustrated in Figure [4(b) | shows the usefulness of the proposed method in change 
detection under non-Gaussianity. 



251 5 Applications 

252 In this section, experiments arc conducted on a synthetic gene expression dataset 

253 and a Twitter dataset, respectively. We consider only the KLIEP and Flasso 
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(a) Diamond distribution 



(b) KLIEP 



(c) P-R curve 



Fig. 4. Experimental results on the diamond dataset. "NPN" and "POLY" denote the 
nonparanormal and polynomial models, respectively. Note that the precision rate of 
100% recall for a random guess is approximately 20%. 



methods here. For KLIEP, the polynomial transform function with k £ {2,3,4} 
is used. The parameter Ai in KLIEP and Flasso is tested with choices Ai £ 
{0.1,1,10}. The performance reported in two experiments in this section are 
obtained using the models selected by HOLL and 5- fold CVLL (see Section |4~TJ) . 
respectively. 



5.1 Synthetic Gene Expression Dataset 



A gene regulatory network encodes interactions between DNA segments. How- 
ever, the way genes interact may change due to environmental or biological 
stimuli. In this experiment, we focus on detecting such changes. We use Syn- 
TReN, which is a generator of gene regulatory networks used as the benchmark 
validation of bioinformatics algorithms |22j . 

To test the applicability of the proposed method, we first choose a sub- 
network containing 13 nodes from an existing signalling network in Saccha- 



romyces cerevisiae (shown in Figure 5(a) ). Three types of interactions are mod- 
elled: activation (ac), deactivation (re), and dual (du). 50 samples are generated 
in the first stage, after which we change the types of interactions in 6 edges, and 
generate 50 samples again. Four types of changes are considered in such case: ac 
— > re, re — > ac, du — > ac, and du — > re. 



The rcgularization paths for KLIEP and Flasso are plotted in Figures 5(b) 
and 5(d) Averaged precision-recall curves over 20 simulation runs are shown 
Clearly from the example of KLIEP regularization paths shown 
the magnitude of estimated parameters on the changed pairwise 



5(c) 



5(d) 



in Figure 

in Figure 

interactions is much higher than that of the unchanged ones, and hits zero only at 

the final stage. On the other hand, Flasso gives many false alarms by assigning 

non-zero parameters to the unchanged interactions, even after some changed 



279 ones hit zeros. Reflecting a similar pattern, the P-R curves plot in Figure 5(c) 



show that the proposed KLIEP method achieves significant improvement over 
the Flasso method. 



13 



Changed 
Unchanged 
X picked 




(a) Gene regulatory network 



(b) KLIEP 



KLIEP 

Flasso 




0.2 0.4 0.6 0.E 

recal 




10' 

(d) Flasso 



(c) P-R curve 
Fig. 5. Experiments on synthetic gene expression datasets. 



282 5.2 Twitter Story Telling 
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In this experiment, we use KLIEP and Flasso as event detectors from Twitter. 
More specifically, we choose the Deepwater Horizon oil spill |j as the target 
event, and we hope that our method can recover some story lines from Twitter 
as the news event develops. Counting the frequencies of 10 keywords (BP, oil, 
spill, Mexico, gulf, coast, Hayward, Halliburton, Transoccan, and Obama), we 
obtain a dataset by sampling 1061 times (4 per day), from February 1st, 2010 
to October 15th, 2010. 

To conduct our experiments, we segment the data into two parts. The first 
300 samples which are collected before the day of oil spill (April 20th, 2010) are 
regarded as conforming to a 10-dimensional joint distribution Q, while the second 
set of samples that are drawn in an arbitrary 50-day window approximately after 
the event happened is regarded as following distribution P. 

The MN of Q encodes the original conditional independence of frequencies 
between 10 keywords, and the underlying MN of P has changed since the event 
occurred. Thus, unveiling a change in MNs between P and Q may recover popular 
topic trends on Twitter in terms of the dependency among keywords. 

The detected change graph on 10 keywords are illustrated in Figure [6] The 
edges are selected at a certain value of A2 indicated by the maximal CVLL. Since 
the edge set that is picked by CVLL may not be sparse in general, we sparsified 



' |http : //en . wikipedia . org/wiki/Deepwater_Horizon_oil_spiiT] 



14 






0.12 

08 
06 
0.04 
0.00 

(a) April 17th-June 5th, (b) June 6th-July 25th, (c) July 26th-Sept. 14th, 
KLTRP KT.TF.P KLIEP 

(hEjij) (ffJif) 

fsplf, 



(bp\ te*r^ 

(d) April 17th-June 5th, (e) June 6th-July 25th, (f) July 26th-Sept. 
Flasso Flasso 14th, Flasso 

Fig. 6. Change graphs captured by the proposed KLIEP method (top) and the Flasso 
method (bottom). The date range beneath each figure is when P was sampled, while 
the Q is fixed to dates from Feb 1st to April 17th. Notable structures shared by the 
graph of both methods are surrounded by the green dashed lines. Unique structures 
that only appear in the graph of proposed method are surrounded by the red dashed 
lines. 



the graph based on the permutation test as follows: we randomly shuffle the 
samples from P and Q and repeatedly run change detection algorithms for 100 
times; then we count the number of detected edges by CVLL. Finally, we select 
the edges that are detected using the original non-shuffled dataset and remove 
those that were detected in the shuffled datasets for more than 5 times. In 
Figure El we plot detected differential graphs which are generated using samples 
of P starting from April 17th, July 6th, and July 26th. 

The initial explosion happened on April 20th, 2010. Both methods dis- 
cover dependency changes between keywords. Generally speaking, KLIEP cap- 
tures more conditional independence changes between keywords than the Flasso 
method, especi ally when compar ing Figure 6(c) and Figure [6 (f)| At the first two 
stages (Figures l6(a)l . l6(bl 16(d)! and l6(e)l ). the keyword "Obama" is very well 
connected with other keywords in the results given by both methods. Indeed, at 
the early development of this event, he lies in the center of the news stories, and 
his media exposure peaks after his visit to the Louisiana coast (May 2nd, May 
28nd, and June 5th) and his meeting with BP CEO Tony Hayward on June 16th. 
Not ably, both methods highlight the "gulf-obama-coast" trian gle i n Fig ures 16(a)! 
and 16(d)! and the "bp-obama- hayward" chain in Figures 16(b)! and 16(c) 



However, there are some important differences worth mentioning. First , the 



Fla sso m ethod misses the "transoccan-hayward-obama" triangle in Figures 16(d) 



322 and l6(c)l . Transocean is the contracted operator in the Dccpwater Horizon plat- 
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form, where the initial explosion happened. The chain "bp-spill-oil" may indi- 
cate that the phrase "bp spill" or "oil spill" has been publicly recognized by 
the Twitter community, while the "hayward-bp-mexico" triangle, yet relatively 
weak, may link to the event that Hayward steped down from the CEO position 
on June 27th. 



6 Conclusion and Future Work 

In this paper, we proposed a direct approach to learning sparse changes in MNs 
by density ratio estimation. Rather than fitting two MNs separately to data and 
comparing them to detect a change, we estimated the ratio of two MNs where 
changes can be naturally encoded as sparsity patterns in estimated parameters. 
Through experiments on artificial and real-world datasets, we demostrated the 
usefulness of the proposed method. 

Compared with the conventional two-stage MLE approach, a notable advan- 
tage of our method is that the normalization term in the density ratio model 
can be approximated by a sample average without sampling. This considerably 
loosens the restriction on applicable distributions. Moreover, thanks to its direct 
modeling nature with density ratios, the number of parameters is halved. 

We only considered MNs with pairwise factors in this paper. However, such 
a model may be misspecified when higher order interactions exist. For example, 
combination with the idea hierarchical log-linear model presented in |15| may 
lead to a promising solution to this problem, which will be investigated in our 
future work. 
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